-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check stride on preallocated output for matmul (fixes #15286) #15288
Conversation
Easy-click link: #15286. |
LGTM. Andreas? (aside: BLIS https://github.com/flame/blis would allow arbitrary strides here) |
Check stride on preallocated output for matmul (fixes #15286)
@tkelman We allow arbitrary strides so I'm wondering how much speedup BLIS can get on matrices with special strides. |
Yeah, it depends on how well optimized the non unit stride case is in BLIS relative to the julia generic gemm. Probably not commonly benchmarked but worth trying. |
When strides get so big that there's only 1 element per cache line, I suspect the best performance might be achieved by copying (which essentially compacts the data). |
I believe BLIS does copy for the non unit stride case in order to still use optimized simd operations, but only one panel at a time rather than the entire array. Should reread their papers and code though. On typical dgemm their haswell kernels are quite competitive with openblas and mkl. |
Interesting. I'd be surprised if we couldn't someday match them in pure julia with |
Yeah, the basic code generation patterns they do would all translate naturally into Julia style code generation (nicer, actually, since they're leaning heavily on the c preprocessor), and we could use their kernels as inline llvm or asm. Some day. |
CC @andreasnoack, @machiningcentre